Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Enhance Falco syscall events triggering and reliability #220

Merged
merged 5 commits into from
Sep 20, 2024

Conversation

prezha
Copy link
Contributor

@prezha prezha commented Sep 16, 2024

foreword/background:
wanted to use falco event-generator for stability/regression testing, but hit some issues that this pr aims to fix
i appreciate this is a bigger pr with lots of changes for i did not have a time to break it down into individual commits/PRs - the intention here is to offer the improvements i made and i'm, of course, happy to provide any additional clarification needed

changelog:

  • added cleanup to several events, preventing issues like inode starvation if run in loop
  • prevent creating zombie processes (more details in the comment below)
  • added usage of "random" temp files/folders names where missing, to avoid eventual overlaps and errors
  • added missing binaries to Dockerfile (container image) required for some events to be able to trigger at all
  • fixed binaries' calls so that events are able to trigger (instead of just error due to eg, command invalid arguments, permission denied or other exit errors) - around 20 "action errors" fixed, so now we have all 60+ events triggering
  • fixed issue with CombinedServerClient() func where call might end before server was able to catch a message sent by client (hence not triggering event)
  • fixed event names so they match actual rule names, so respective tests do not fail any more
  • added prerequisites checking for some events so some can be reasonably skipped instead of erroring (eg, binary presence, in-container, on-host, [not] privileged container, etc.)
  • added missing error checking/handling
  • added missing timeout to command calls that could hang for longer, speeding up a bit overall execution time
  • removed a need for bash from the final container image - sh is now sufficient
  • removed deprecated DisallowedSSHConnection rule (ref)
  • removed redundant ContactCloudMetadataServiceFromContainer (falco would likely trigger "Contact EC2 Instance Metadata Service From Container" instead)
  • fixed build issues on macos by excluding linux-specific events
  • continue if a test fails, don't terminate execution
  • don't silence info logs if debug log is emitted

bonus:

  • cleaned up Makefile and disabled CGO_ENABLED for building the binary
  • eliminated all 37 linter (golangci-lint v1.61.0) errors and complains
  • used specific base container image tags in Dockerfile instead of "latest", so that trivy v0.55.1 vulnerability scanner finds zero issues with os-level packages in the final container image (alpine 3.20.3)
  • bumped go to v1.23.1 and updated dependencies, so that vulnerability scanners (trivy v0.55.1 and govulncheck v1.1.3) do not find any known issues with the code/dependencies (like GO-2024-2687 - HTTP/2 CONTINUATION flood in net/http, CVE-2023-45288 and CVE-2024-24786, etc.)
  • refactored so that how and what we name and log are a bit more consistent, simplified code flow in general
  • converted files with crlf to lf line-ending

testing performed

  • make test: all local tests are passing
  • event-generator test: all tests are now passing - tested in minikube v1.34 using KVM and containerd with falco modern_ebpf driver and with both privileged and non-privileged mode/container

/kind bug
/kind cleanup
/kind tests

/area events

@poiana
Copy link

poiana commented Sep 16, 2024

Welcome @prezha! It looks like this is your first PR to falcosecurity/event-generator 🎉

- Clean up several events to prevent inode starvation and other issues during repeated runs
- Use random temp file/folder names to avoid overlaps and errors
- Add missing binaries and fix their calls to enable all 60+ events to trigger correctly
- Fix CombinedServerClient() and event naming issues to improve test reliability
- Add prerequisite checks for some events and improve error handling
- Add timeouts to long-running commands and remove unnecessary bash dependency
- Remove deprecated and redundant Falco rules
- Fix build issues on macOS and ensure tests continue even if one fails

Bonus:
- Clean up Makefile, disable CGO_ENABLED, and fix all linter errors
- Use specific base image tags in Dockerfile to ensure a clean vulnerability scan
- Update Go and dependencies to address known security issues
- Refactor for consistency and code clarity, convert CRLF to LF line endings

Signed-off-by: Predrag Rogic <[email protected]>
@leogr
Copy link
Member

leogr commented Sep 16, 2024

Hey @prezha

Thank you for this contribution! 🤩
The review will take a bit of time due to the size of this PR, but I wanted to let you know that we will take a look.

🙏

@alacuku
Copy link
Member

alacuku commented Sep 16, 2024

Hey @prezha, thanks for the contribution! I will take a look in the coming days!

@prezha
Copy link
Contributor Author

prezha commented Sep 16, 2024

hey @leogr and @alacuku , you are most welcome and thank you both for considering the pr, i appreciate it!

Signed-off-by: Predrag Rogic <[email protected]>
@prezha
Copy link
Contributor Author

prezha commented Sep 19, 2024

hey @leogr and @alacuku, just a heads up: after some additional testing i noticed that the event-generator creates an increasing number of zombie processes (which can eventually create issues in the loop mode), so i amended the pr to prevent that

this is due to how we handle killing sleep process in PtraceAntiDebugAttempt() and PtraceAttachedToProcess() (where we need to also wait for the killed process to exit before moving on), as well as how we are limiting command execution time using the timeout (which i replaced now with the context.WithTimeout() and exec.CommandContext(), so they are handled properly)

Copy link
Member

@leogr leogr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @prezha

This PR is fantastic:star_struck: you did a terrific job, thank you!

I did a first review pass, and only found very minor issues (see comments below), otherwise it seems excellent to me.

I'll give it a try with a real Falco installation as soon as I can, then I will approve.

🙏

@@ -21,14 +21,21 @@ import (
)

var _ = events.Register(
CreateHiddenFileOrDirectory,
CreateHiddenFilesOrDirectories,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also rename the file name to create_hidden_file_or_directories.go, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've renamed it to create_hidden_files_or_directories.go

@@ -24,16 +24,21 @@ import (
)

var _ = events.Register(
LaunchIngressRemoteFileCopyToolsInsideContainer,
LaunchIngressRemoteFileCopyToolsInContainer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed the file accordingly

)

var _ = events.Register(PotentialLocalPrivillegeEscalation)
var _ = events.Register(PotentialLocalPrivilegeEscalationViaEnvironmentVariablesMisuse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file name does not match.
Note the typo 👇
potential_local_privillege_escalation_via_env_var_misuse.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed file to potential_local_privilege_escalation_via_environment_variables_misuse.go

@@ -25,12 +27,16 @@ var _ = events.Register(
)

func ReadEnvironmentVariableFromProcFiles(h events.Helper) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File name missing _files

@leogr
Copy link
Member

leogr commented Sep 20, 2024

Tried with Falco 0.38.3

$ sudo ./event-generator test syscall
INFO sleep for 100ms                               action=syscall.PacketSocketCreatedInContainer
WARN action skipped                                action=syscall.PacketSocketCreatedInContainer reason="only applicable to containers"
INFO sleep for 100ms                               action=syscall.CreateHardlinkOverSensitiveFiles
INFO action executed                               action=syscall.CreateHardlinkOverSensitiveFiles
INFO test passed                                   action=syscall.CreateHardlinkOverSensitiveFiles rule="Create Hardlink Over Sensitive Files" source=syscall
INFO sleep for 100ms                               action=syscall.NetcatRemoteCodeExecutionInContainer
WARN action skipped                                action=syscall.NetcatRemoteCodeExecutionInContainer reason="only applicable to containers"
INFO sleep for 100ms                               action=syscall.FindAwsCredentials
INFO action executed                               action=syscall.FindAwsCredentials
INFO test passed                                   action=syscall.FindAwsCredentials rule="Find AWS Credentials" source=syscall
INFO sleep for 100ms                               action=syscall.MountLaunchedInPrivilegedContainer
WARN action skipped                                action=syscall.MountLaunchedInPrivilegedContainer reason="only applicable to containers"
INFO sleep for 100ms                               action=syscall.CreateSymlinkOverSensitiveFiles
INFO action executed                               action=syscall.CreateSymlinkOverSensitiveFiles
INFO test passed                                   action=syscall.CreateSymlinkOverSensitiveFiles rule="Create Symlink Over Sensitive Files" source=syscall
INFO sleep for 100ms                               action=syscall.RemoveBulkDataFromDisk
INFO action executed                               action=syscall.RemoveBulkDataFromDisk
INFO test passed                                   action=syscall.RemoveBulkDataFromDisk rule="Remove Bulk Data from Disk" source=syscall
INFO sleep for 100ms                               action=syscall.DisallowedSSHConnectionNonStandardPort
ERRO action error                                  action=syscall.DisallowedSSHConnectionNonStandardPort error="context deadline exceeded"
INFO sleep for 100ms                               action=syscall.DropAndExecuteNewBinaryInContainer
WARN action skipped                                action=syscall.DropAndExecuteNewBinaryInContainer reason="only applicable to containers"
INFO sleep for 100ms                               action=syscall.RunShellUntrusted
INFO spawn as "httpd"                              action=syscall.RunShellUntrusted args="^helper.RunShell$"
INFO sleep for 100ms                               action=helper.RunShell as=httpd
INFO action executed                               action=helper.RunShell as=httpd
INFO test passed                                   action=syscall.RunShellUntrusted rule="Run shell untrusted" source=syscall
INFO sleep for 100ms                               action=syscall.LaunchSuspiciousNetworkToolOnHost
WARN action skipped                                action=syscall.LaunchSuspiciousNetworkToolOnHost reason="nmap executable file not found in $PATH"
INFO sleep for 100ms                               action=syscall.ClearLogActivities
INFO action executed                               action=syscall.ClearLogActivities
INFO test passed                                   action=syscall.ClearLogActivities rule="Clear Log Activities" source=syscall
INFO sleep for 100ms                               action=syscall.DebugfsLaunchedInPrivilegedContainer
WARN action skipped                                action=syscall.DebugfsLaunchedInPrivilegedContainer reason="only applicable to containers"
INFO sleep for 100ms                               action=syscall.ExecutionFromDevShm
INFO action executed                               action=syscall.ExecutionFromDevShm
INFO test passed                                   action=syscall.ExecutionFromDevShm rule="Execution from /dev/shm" source=syscall
INFO sleep for 100ms                               action=syscall.FilelessExecutionViaMemfdCreate
INFO action executed                               action=syscall.FilelessExecutionViaMemfdCreate
INFO test passed                                   action=syscall.FilelessExecutionViaMemfdCreate rule="Fileless execution via memfd_create" source=syscall
INFO sleep for 100ms                               action=syscall.ReadSensitiveFileUntrusted
INFO action executed                               action=syscall.ReadSensitiveFileUntrusted
INFO test passed                                   action=syscall.ReadSensitiveFileUntrusted rule="Read sensitive file untrusted" source=syscall
INFO sleep for 100ms                               action=syscall.JavaProcessClassFileDownload
INFO spawn as "java"                               action=syscall.JavaProcessClassFileDownload args="^helper.CombinedServerClient$"
INFO sleep for 100ms                               action=helper.CombinedServerClient
ERRO action error                                  action=syscall.JavaProcessClassFileDownload error="context deadline exceeded"
INFO sleep for 100ms                               action=syscall.DirectoryTraversalMonitoredFileRead
INFO action executed                               action=syscall.DirectoryTraversalMonitoredFileRead
INFO test passed                                   action=syscall.DirectoryTraversalMonitoredFileRead rule="Directory traversal monitored file read" source=syscall
INFO sleep for 100ms                               action=syscall.DetectReleaseAgentFileContainerEscapes
WARN action skipped                                action=syscall.DetectReleaseAgentFileContainerEscapes reason="only applicable to containers"
INFO sleep for 100ms                               action=syscall.PotentialLocalPrivilegeEscalationViaEnvironmentVariablesMisuse
INFO action executed                               action=syscall.PotentialLocalPrivilegeEscalationViaEnvironmentVariablesMisuse
ERRO action error                                  action=syscall.PotentialLocalPrivilegeEscalationViaEnvironmentVariablesMisuse error="context deadline exceeded"
INFO sleep for 100ms                               action=syscall.ReadSensitiveFileTrustedAfterStartup
INFO spawn as "httpd"                              action=syscall.ReadSensitiveFileTrustedAfterStartup args="^syscall.ReadSensitiveFileUntrusted$ --sleep 6s"
INFO sleep for 6s                                  action=syscall.ReadSensitiveFileUntrusted as=httpd
INFO action executed                               action=syscall.ReadSensitiveFileUntrusted as=httpd
INFO test passed                                   action=syscall.ReadSensitiveFileTrustedAfterStartup rule="Read sensitive file trusted after startup" source=syscall
INFO sleep for 100ms                               action=syscall.SystemUserInteractive
INFO run as "daemon"                               action=syscall.SystemUserInteractive cmdArgs="[]" cmdName=/bin/login user=daemon
INFO test passed                                   action=syscall.SystemUserInteractive rule="System user interactive" source=syscall
INFO sleep for 100ms                               action=syscall.PtraceAntiDebugAttempt
INFO action executed                               action=syscall.PtraceAntiDebugAttempt
INFO test passed                                   action=syscall.PtraceAntiDebugAttempt rule="PTRACE anti-debug attempt" source=syscall
INFO sleep for 100ms                               action=syscall.PtraceAttachedToProcess
INFO test passed                                   action=syscall.PtraceAttachedToProcess rule="PTRACE attached to process" source=syscall
INFO sleep for 100ms                               action=syscall.SearchPrivateKeysOrPasswords
INFO action executed                               action=syscall.SearchPrivateKeysOrPasswords
INFO test passed                                   action=syscall.SearchPrivateKeysOrPasswords rule="Search Private Keys or Passwords" source=syscall

I got just 3 failures:

  • action=syscall.DisallowedSSHConnectionNonStandardPort error="context deadline exceeded"
  • action=syscall.JavaProcessClassFileDownload error="context deadline exceeded"
  • action=syscall.PotentialLocalPrivilegeEscalationViaEnvironmentVariablesMisuse error="context deadline exceeded"
    Any clue for these?

Anything else worked as expected 🚀

@prezha
Copy link
Contributor Author

prezha commented Sep 20, 2024

thanks @leogr for the review! i'll fix the files names

as for the failures you got:

Tried with Falco 0.38.3

$ sudo ./event-generator test syscall
INFO sleep for 100ms                               action=syscall.PacketSocketCreatedInContainer

...


I got just 3 failures:

* `action=syscall.DisallowedSSHConnectionNonStandardPort error="context deadline exceeded"`
* `action=syscall.JavaProcessClassFileDownload error="context deadline exceeded"`
* `action=syscall.PotentialLocalPrivilegeEscalationViaEnvironmentVariablesMisuse error="context deadline exceeded"`
  Any clue for these?

yep, i think it's because the default 100ms "sleep" sometimes is not sufficient to capture the event falco emits (and match it to the expected one in the test), so adding --sleep=1s worked for me reliably in my local tests (in minikube), can you please try with that

@prezha
Copy link
Contributor Author

prezha commented Sep 20, 2024

i've renamed all suggested files and few others for consistency

@prezha
Copy link
Contributor Author

prezha commented Sep 20, 2024

just realized that gh workflows also need updated go version, which i added in the last 501ed2b commit

@leogr
Copy link
Member

leogr commented Sep 20, 2024

yep, i think it's because the default 100ms "sleep" sometimes is not sufficient to capture the event falco emits (and match it to the expected one in the test), so adding --sleep=1s worked for me reliably in my local tests (in minikube), can you please try with that

Unfortunately, more --sleep did not work for me (I'm using lima on an Apple Silicon). I guess the sleep value can't be the root issue since it affects just the sleeping time between different actions. I've also tried using --test-timeout=2m, but they fail anyway.

That said, I believe we can dig into this issue in a follow-up PR. This is already great and big enough 😅

@poiana
Copy link

poiana commented Sep 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: leogr, prezha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana
Copy link

poiana commented Sep 20, 2024

LGTM label has been added.

Git tree hash: 7df09df584861eb1b1c2442fb8f0874d6f958de3

@poiana poiana merged commit 490b030 into falcosecurity:main Sep 20, 2024
4 checks passed
@leogr leogr added this to the v0.12.0 milestone Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants